Up to Table of Contents

Ahead to Nov 95 Features

Hammer Out a Multiprocessor Strategy

Nov 95 Features

Hammer Out a Multiprocessor Strategy

Symmetric multiprocessing...

Can add muscle to your overworked network server,

as can advanced disk and connection strategies. Cut the flab from your network setup and see streamlined performance.

BY: John D. Ruley, Editor-at-Large

It's been said that many hands make light work. Given that premise, you might assume that adding a second (or third) processor to a hard-working network server would solve all your speed problems. But that assumption wouldn't win you any bets. We ran a battery of tests which show that alternate strategies--such as adding more RAM or increasing disk throughput and/or the processor speed of a server--often yield better results.

In our tests, we measured the performance of servers with various configurations of processor speed, number of processors, RAM size and workload as used by the fictitious "New Technology Bank of Modesto." (Modesto is a small town in north-central California; if you've seen the movie American Graffiti, you've seen Modesto.) Our "bank" proved an instructive and practical model for testing symmetric multiprocessing (SMP). (For a valuable introduction to SMP technology, see Processing Power.)

Test Case 1

Click Here to see a 2.98KB bitmap image of artwork which goes with this article, entitled:
20-Client Banking Simulation

Click Here to see a 3.79KB bitmap image of artwork which goes with this article, entitled:
50-Client Banking Simulation

Let's suppose a group of investors has decided to exploit the latest in data processing technology to support the New Technology Bank of Modesto--a bank where all transactions are conducted using ATMs or by other electronic means. Such a system is certainly possible; our mission is to decide what kinds of hardware and software are required.

Software-wise, an advanced operating system like Windows NT Server is an obvious choice. Why? NT Server is scalable, supporting huge disk arrays (up to 408 million terabytes), large amounts of memory (up to 4GB) and SMP--including both Intel processors and reduced instruction set-computing (RISC) technologies. Our bank may start small, but choosing NT ensures that our needs won't outgrow the OS. That foresight can save us from costly upgrades and training later on in the process.

Of course, the operating system isn't the only thing we have to address. In running an all-electronic bank, our most critical operations will depend on maintaining a continuously updated set of user accounts. When customers walk up to an ATM and attempt to make withdrawals, we must be able to find out whether their accounts contain the necessary funds before committing the transaction, then debit the accounts and record the results on disk. Such operations are critical by their very nature. If we lose any account data or the database becomes corrupt, our bank is dead.

Fortunately, there are several excellent multi-user databases available for NT Server. The one we've selected is Microsoft's SQL Server 6.0, which fully exploits NT's scaling features.

Can we expect the combination of SQL Server and NT to deliver adequate performance for our bank? Microsoft's published benchmarks show performance in excess of 100 transactions per second (tps) from single-Pentium servers, scaling to more than 1,000tps on multiprocessors. The question, then: "Is that enough?"

Each activity a customer carries out at an ATM represents one transaction. This includes retrieving balances and making withdrawals, deposits or transfers. It is extremely rare for a customer to conduct more than a few transactions per minute. Therefore, a transaction rate of 1tps per ATM would seem to be a very safe worst case. If our bank will start with 20 ATMs, a system capable of delivering just 20tps will be more than enough.

Clearly, a single Pentium server will be more than enough to do the job--but with how much RAM and how much hard disk space?

Starting at the Bottom

Microsoft specifies 16MB as the minimum memory for both SQL Server 6.0 and NT Server 3.51, so that's where we started. As for hard drives, the capacity depends on the number of customers. Assuming the bank has 10,000 customers, each carrying out an average of one transaction per day, we might need space for 3,650,000 database records per year. If each transaction takes just 1KB, we're looking at a 3.7GB database. Our test system doesn't actually have a disk that large (see Testing SQL Server Scaling"), but it does have 2GB, which should be more than enough for some initial experiments.

Now all that remains is to decide how many client systems we want to run, and how to connect them with the server. In keeping with our New Technology Bank of Modesto scenario, let's start with a 20-client system, each representing an ATM. Since we know the data rate for each client will be 1tps or less, we should be able to use an inexpensive, low-data-rate (38.4Kbaud) serial-port connection. Conveniently, NT Server has Remote Access Services (RAS) built in to support such connections. Running a 20-client case on our minimally configured server yields a pokey 16.9tps--below our desired 20tps minimum.

Finding the Bottleneck

Why are we getting only 16.9tps? We know the processor is capable of better performance. Clearly, some other component is acting as a bottleneck, blocking us from achieving the sort of performance our system should easily produce.

As it happens, NT comes with an extremely powerful tool called Performance Monitor that identifies bottlenecks. It amounts to a built-in system analyzer that can be used to peek under the hood at all sorts of interesting things about both NT and its applications. So we needn't guess at what's clogging up the system--we can look. (For more on Performance Monitor, see the Windows NT column in this issue.)

When we do so, we note two interesting facts. Network traffic is running at more than 10Kbaud--below our theoretical limit of 38.4Kbaud, but high enough to worry us--and page faults (the number of times NT exhausts its memory and has to swap a 4Kb "page" to disk to free up additional memory) are running at 50 to 100 per second. That suggests we may be running out of memory, network bandwidth or both. To determine which, we try changing each--raising RAM to 32MB, and moving from the 38.4Kbaud serial line to 10Mb per second (Mbps) Ethernet, which is about 30 times faster. The results, shown in the box on the previous page, are quite interesting.

As you can see, increasing the network bandwidth while leaving RAM at 16MB has little effect. Increasing RAM to 32MB while leaving the network at 38.4Kbaud is a little better, but combining the two yields an extremely impressive result--close to five times the original performance.

That's typical of the way bottlenecks work. With just 16MB of RAM, our platform is capable of no more than 17.5tps, regardless of the network bandwidth. With just 38.4Kbaud of bandwidth, we're capable of no more than 27tps, regardless of the RAM. Clearing one bottleneck exposes the other.

When both bottlenecks are cleared, we finally begin to see the performance we expected. Of course, there are some negative implications to what we've discovered--for example, it would be significantly more expensive to run Ethernet cabling to all our ATMs--so we might decide to perform just the memory upgrade on our server, since the resulting 27tps exceeds our requirements.

That's sufficient for our present purposes--the Modesto scenario needs only 20tps. But what if we want to scale the concept up a bit?

Test Case 2

Suppose we want to expand the "NT Bank" model to a nationwide service. It's easy to imagine 1,000 branch banks like the one we've described, each with 20 ATMs. Can the NT/SQL Server platform scale up to something that large? To answer that question, we'll try scaling from one to two processors, as shown in the graphic SMP Scalability.

Click Here to see a 5.72KB bitmap image of artwork which goes with this article, entitled:
SMP Scalability

It's curious that although we doubled the processing power, matters didn't improve much. This looks very much like we've hit another bottleneck. To find it, we bring up Performance Monitor again. A quick check shows that network traffic is well below Ethernet's 10Mbps limit. So the bottleneck doesn't lie in the network or memory. Processor utilization is 70 percent in the first case and 37 percent in the second--so adding a second processor hasn't helped.

With memory, network and the CPU eliminated, about all that's left is disk traffic. Checking that, we find the hard disk is running at 100 percent utilization, delivering 350KBps of data. That's our bottleneck.

To relieve it, we can exploit the fact that SQL Server accesses two separate database devices: a data device and a transaction log device. Moving the latter to a separate drive helps matters, as shown here.

This alteration doesn't completely solve the problem, however. You'd expect the second processor to nearly double performance. It doesn't, so out comes Performance Monitor again. We find that with one processor, we're finally seeing 100 percent CPU utilization and getting about 360KBps in disk activity from the two drives. The two-processor case only gives 75 percent utilization (which means the ultimate limit on a two-CPU system is in the 400tps range), but we're getting 100 percent disk utilization at just 600KBps. In other words, we've maxed out the disks again. Installing additional disks (probably as part of a disk array such as RAID) could relieve the bottleneck, getting us into the 400tps range--but we've now started to see the true limit on server performance.

The Law of Diminishing Returns

When we relieved the disk bottleneck on our 90MHz Pentium-based server, we measured performance of almost 250tps. This suggests that adding a second processor would give us nearly 500tps. It does not--it produces just 400tps under ideal conditions. Of course, we deliberately kept a consistent heavy workload on the server--a far greater load than you would see in real-life conditions due to lulls and lags in real activity. Why? Because of a phenomenon called "contention" that's governed by Amdahl's Law, which basically says there's no such thing as a free lunch.

When you try to spread a single task or series of tasks over several CPUs, performance is governed by how frequently the processors contend for a shared resource: RAM, or access to disk drives, or even the network card. In each case, the problem is that both processors cannot access the same device at the same time, so each processor you add gives less benefit than its predecessor.

Under ideal conditions, adding the second processor to our system would improve performance by around 60 percent. If that trend continues, a third CPU would add no more than 60 percent of the performance gained by adding the second, and so on.

There's a dramatically reduced bang for the buck in adding more than about four processors. Moreover, we have to consider the possibility of other bottlenecks. As we've seen, getting much more than 150tps required adding a second disk drive, which in turn limited us to 300tps--so any 600tps figures would require at least a four-drive disk array. We would also probably have to relieve a network bandwidth bottleneck to get real-world results this high.

Whether a particular application will scale, and how well, depends on details of both the application design and the hardware platform. Reconfiguring our database to use a memory-based temporary transaction store, plus increasing secondary cache memory in the server, might give us 80 percent scaling, but so long as transactions cannot be performed entirely in memory, you're not going to do much better than what we show here.

Think about that nationwide New Technology bank. We determined that a branch bank with 20 stations needed a 20tps server. But how many of those branches might our bank have? It doesn't matter--achieving more than a few hundred tps is all but impossible with a single server, no matter how many processors are thrown in.

But why does our nationwide NT bank need to handle all transactions on a single server? Real banks don't work that way. Instead, local transactions are recorded at the branch bank, which reports a summary of activity to the national headquarters daily--that's why transactions recorded after 3 p.m. are processed the next business day.

What happens when a customer tries to process a transaction out of town? If each branch operates independently, you've got a real problem. Either you can't handle out-of-town transactions--in which case, the customer will probably decide our bank is not worth using--or you run the risk of having multiple copies of the customer's account, which can get out of sync.

To get around this, the branch bank needs to execute a two-step process: If the customer's account is local, the transaction is processed directly. If the account isn't local, the branch copies the customer's account from his home branch. The system then processes the transaction and returns updated information to the home branch. This depends on having some mechanism to connect all the branches--which itself may become a bottleneck.

Imagine a company that tries to use a small number of SMP-based regional servers instead of a larger number of branch office machines. The thinking might be that doing so will improve performance, but in all probability this would be limited by network traffic, with thousands of customers trying to access each server. The bottleneck, like the card in a magic trick, may move, but it never disappears entirely.

In any case, this approach is how database operations are scaled up to the enterprise level in the real world, with databases distributed among multiple servers, summary transactions rolled up to headquarters in a periodic report, and database servers communicating among themselves. SMP is just one tool for optimizing their performance; it's not a one-step solution.

A listing of vendors of scalable servers is available in Winmag Online Locations.

File and Printer Scaling

Multiprocessor capability is both useful and effective when put to work in a large network environment--but what about less-sizable setups where only basic shared-file and -print services are needed? To find out, we experimented with workload simulations on the test networks described in the main part of this feature. The simulations involved a varying number of client programs, which would wait for a random "think" interval and then perform a file activity--repeating the process as often as desired.

With up to 50 clients on a very active schedule ("thinking" for 10 seconds, then writing a file averaging 16KB), CPU utilization on a single Pentium-60 never exceeded 25 percent. Further examination showed that disk I/O was a bottleneck, leaving the Ethernet LAN running at half its maximum throughput. With a better disk, we might saturate the Ethernet with 100 clients, but even then we'd be using only 50 percent of a Pentium-60, so for small LANs there is no benefit to adding more CPUs.

Our first reaction is to say there's no benefit to using SMP in a file- and print-service environment. While that may be true for small LANs, big ones are a different story.

Scaling Up

With a high-performance disk setup, we can probably saturate 10Mbps Ethernet with 100 clients, at a CPU utilization of 50 percent. What if we replace the Ethernet with FDDI, which has a bandwidth of 100Mbps? We'll probably need to boost disk bandwidth again. Assuming that we can solve the disk bottleneck, we'll get up to around twice the normal Ethernet bandwidth at 200 clients, at which point we will saturate the CPU. Adding another CPU should relieve this bottleneck, and--provided we don't saturate the disk array--we can support 400 clients at 40Mbps of network bandwidth.

As with our SQL Server tests, assuming that file and print services will scale perfectly is probably naive. Determining whether further SMP scaling follows the 80 percent or 60 percent contour probably depends in part on how much memory the server has. NT's disk subsystem uses excess RAM as an active disk cache, which helps offload the hard disk and can greatly improve performance.

Assuming 80 percent scaling, eight processors should give about five times the performance of a single CPU, allowing a worst-case scenario of 250 extremely active clients using more than 50 percent of the available 100Mbps network bandwidth. Moving to newer Pentium processors would boost this further, eating into the network bandwidth and approaching a theoretical limit of 500 clients.

Again, remember that our workload simulation assumes an unrealistically active client--one that "thinks" for 10 seconds or less before performing any disk activity. Real-world network users don't think that fast--each has a "think" time that's measured in minutes. That means our 500-client maximum workload case probably translates to several thousand users in the real world.

Testing SQL Server Scaling

Testing SMP performance is a challenging task. To do it, we used two server platforms: an NCR 3360 with 32MB of RAM and an AT&T GIS with 128MB of RAM. Both servers use two Pentium processors, the former clocked at 60MHz, the latter at 90MHz.

When needed, we limited Windows NT to leveraging only one processor by adding an undocumented switch (/MAXNUMPROC=1) to the BOOT.INI file. A similar /MAXMEM=16 switch was used to set the server to 16MB.

We tested the servers using Microsoft's TPC-B benchmarking kit, which is designed to produce benchmark results applicable to the Transaction Processing Council's "B" specification; we are not claiming full "B" results, however, as we performed only in-memory tests (scaled results would have been approximately 20 percent lower) and did not conduct the necessary cost analysis.

As a result, our tests should not be compared to full TPC-B tests, and are provided for instructional purposes only.

We executed multiple copies of the NT-based client software on client systems (an NCR 486/33 for the 60MHz server, a Gateway Pentium-90 for the Globalyst) connected to the servers on isolated 10BaseT Ethernet segments. We executed NT's built-in Performance Monitor application on both client and server systems to determine processor loading, memory utilization and other performance factors.

Back to Nov 95 Features

Up to Table of Contents

Ahead to Nov 95 Features